zap: TinyZAP for multi-uint64 entries.#18568
Conversation
|
@behlendorf @robn ^^ Please let me know your thoughts on this. |
e87779e to
ccf54f6
Compare
behlendorf
left a comment
There was a problem hiding this comment.
This is shaping up nicely!
| # | ||
| # Copyright (c) 2013, 2014 by Delphix. All rights reserved. | ||
| # Copyright 2016 Nexenta Systems, Inc. All rights reserved. | ||
| # Copyright (c) 2026, Hewlett Packard Enterprise Development LP. |
There was a problem hiding this comment.
Rob's proposed unit test framework in #18564 would be an ideal way to exercise the new TinyZAP code. In addition to basic unit tests (add/remove/lookup) we can verify the various promotion paths behave as intended (MicroZAP -> TinyZAP -> FatZAP, etc).
There was a problem hiding this comment.
Since the full suite isn't upstreamed yet, I'm intending to wire what I have up to this PR and see what falls out. I'll share the results soon.
But also, if this PR lands before the test suite does, I'll be sure to include coverage in the test suite. You get grandfathered in 👴
There was a problem hiding this comment.
Cool, I'll check it out.
I think I'll gradually get the reviews and the required changes for this patch, and probably let's see how things go later.
robn
left a comment
There was a problem hiding this comment.
I'm totally on-board with the idea, but this all seems very convoluted to me.
If I'm understanding all this correctly, is effectively the same as MicroZAP the same as a TinyZAP with chunk=6 (64B) and stride=1 (8B)?
If so, I'd suggest the code would be a lot nicer by actually making the entire implementation by about TinyZAPs (by structure if not by name), and just special case for MicroZAPs: if we don't see MZAP_FLAG_TINY, then use chunk=6, stride=1 and do the extra MZAP_NAME_LEN check in the add and upgrade paths.
If you fold all those checks and math into a small number of macros or inline functions (which you basically already) have, then it seems like this PR should be almost entirely a mechanical conversion, plus the feature flag handling code.
Because of this, my review comments are either small style nits, or design queries that I think would apply regardless of the structure. Whichever way it goes, I'll need another review round.
|
Changes in my latest push:
Things to be discussed:? |
8147b1a to
5ef112e
Compare
|
Sorry if already mentioned, but I suppose this feature will not only be a read-incompatible, but also a send/receive incompatible with older receivers. While I was also thinking about some more efficient ZAP formats for purposes for BRT/DDT, read-incompatible feature means we need to update boot loaders for all OS'es, and add some more feature flags into replication streams. |
b775843 to
c8fd312
Compare
MicroZAP is limited to 1×uint64 values and 49-char keys, any wider entry forces a full FatZAP upgrade. TinyZAP avoids this for the common case of multi-integer values (e.g. Lustre FIDs) and long keys. Introduce TinyZAP, a MicroZAP variant reuses mzap_phys_t, repurposing the padding bytes after mz_normflags as three independent uint8_t fields: mz_flags bit 0 = MZAP_FLAG_TINY mz_chunk_shift log2(chunk): 6=64B, 7=128B, 8=256B mz_value_ints stride / 8 (number of uint64 values per entry) Geometry is stamped automatically on the first zap_add() based on observed entry shape. no create-time hint is required. Subsequent adds must match the stamped geometry or a FatZAP upgrade is triggered. All ZAP operations (add, update, remove, lookup, cursor, byteswap, upgrade to FatZAP) dispatch to TinyZAP paths when zap_stride != 0. Signed-off-by: Akash B <akash-b@hpe.com>
|
Updated the PR description, added new tests, and fixed the chunk upgrade paths and TinyZAP by struct comments. Another round of reviews? |
|
@adilger @tim-day-387 it'd be great if you could take a look at this. I want to make sure we really understand and address Lustre's current ZAP needs and if possible anything related you might be thinking about longer term! |
|
@behlendorf, for future expansion usage by Lustre Metadata Redundancy we have recently expanded the ldiskfs "dirdata" feature to allow storing multiple 16-byte FIDs into a single directory entry to reference multiple inode mirrors, similar to how ZFS dnodes can reference up to 3 block pointers. I can't comment on the details of the implementation, but from the commit message comments it appears that this TinyZap implementation will allow this to work for ZFS as well. There may be some transition time where pre-existing directory ZAPs are not able to create new entries with multiple FIDs since the TinyZAP geometry is fixed by the first entry created in it, but that should be a relatively uncommon configuration. |
Introduce TinyZAP, a new on-disk ZAP format between MicroZAP and FatZAP. MicroZAP is limited to 1xuint64 values and 49-char keys, any wider entry forces a full FatZAP upgrade. TinyZAP avoids this for the common case of multi-integer values (e.g., Lustre FIDs) and long key names.
Signed-off-by: Akash B akash-b@hpe.com
Motivation and Context
This PR introduces TinyZAP, a new on-disk ZAP format that sits between MicroZAP and FatZAP in the ZAP format. TinyZAP extends MicroZAP to efficiently handle multi-word values and long key names without the overhead of a full FatZAP upgrade.
The primary motivation is workloads like Lustre that store multi-integer values (e.g., FIDs:
2-3 x uint64_t) or long filenames in ZAP objects. Previously, these always created a FatZAP, consuming significantly more on-disk space and memory than necessary.ZAP Format Hierarchy (After This Change)
MicroZAP -> TinyZAP -> FatZAP
However, internally TinyZAP is also a MicroZAP (but an extended format). TinyZAP is an in-place extension of MicroZAP that supports multi-integer values and longer key names without upgrading to FatZAP.
Description
TinyZAP reuses the existing
mzap_phys_tblock format.Three previously reserved bytes in the 64-byte header are repurposed as independent
uint8_tfields:When
mz_flags == 0, the block is a plain MicroZAP. WhenMZAP_FLAG_TINY(bit 0) is set, the TinyZAP layout applies. Each chunk slot is atzap_ent_phys_t:Supported chunk sizes and resulting geometry (examples only).
Note: stride=8 with chunk=64 is skipped by
tzap_try_promote()because it provides only 2 bytes more than MicroZAP. Chunk=128 is the minimum for stride=8. chunk=64 is only used when stride >= 16 (num_integers > 1).Other details on ZAP upgrade conditions:
MicroZAP -> TinyZAP Conditions:
Promotion is attempted automatically on the first
zap_add()when the entry fails the plain MicroZAP constraints. All of the following must hold:The stride is stamped once on the first qualifying add and cannot change. The smallest fitting chunk is selected automatically.
For stride=8, promotion is also allowed on a populated MicroZAP: existing entries are re-encoded in-place via
tzap_reencode_micro_to_tiny(), which re-packs the fixed 64-byte MicroZAP slots into the wider TinyZAP chunk format using a buffer.TinyZAP Chunk Upgrade (in-place, stays TinyZAP)
When a new key is too long for
TZAP_NAME_LEN(chunk, stride)but fits a larger chunk size,tzap_try_chunk_upgrade()re-packs all entries into the new chunk size without upgrading to FatZAP. The chunk can grow from 64->128 or 128->256. The block is grown if needed (up tozap_micro_max_size).TinyZAP -> FatZAP Conditions:
A FatZAP upgrade is forced when any of the following occur:
During
mzap_upgrade(), existing TinyZAP entries are re-encoded into FatZAP leaf blocks viatzap_upgrade_entries(). This function reads the original block size from the sz snapshot taken beforefzap_upgrade()changes db_size to 16KB, preventing iteration over ghost slots.Plain MicroZAP -> FatZAP (Unchanged)
If TinyZAP promotion fails (no fitting chunk, integer_size != 8, geometry mismatch), the existing MicroZAP -> FatZAP path is taken.
others:
SPA Feature Flag: com.hpe:tinyzap
A new pool feature,
SPA_FEATURE_TINYZAP (com.hpe:tinyzap)is introduced:ZFEATURE_FLAG_MOS) It's decided that the MOS entries need not use TinyZAP.How Has This Been Tested?
Added simple tests:
Functional test suite which tests or exercises MicroZAP->TinyZAP, chunk upgrade, TinyZAP->FatZAP, remount, readdir, collision, feature flag, etc.
Before the patch using Lustre (FatZap):
Performance:
Total space taken by 1.25 million directories: Total Size: 117.7G (98.86 KB/Inode)
After this patch using Lustre (TinyZap):
Total space taken by 1.25 million directories: Total Size: 3.7G (3.09 KB/Inode)
Performance:
These were the summary of the results overall:
For
draid2:9d:12c:1s-0(flash MDT and 4 OSTs):Directory creation improved by +176% - over 2.75x faster, exceeding 100K ops/sec
Directory removal improved by +141% - over 2.4x faster, exceeding 145K ops/sec
Directory stat improved by +97% at peak - nearly 2× faster, approaching 300K ops/sec
Space Efficiency:
Almost 99% reduction in empty directories.
TinyZap (1-2 KB) vs. FatZAP (100-130 KB).
For 1.25 million directories (TinyZap: 3.7G (3.09 KB/Inode) vs. FATZap: 117.7G (98.86 KB/Inode)), ~32x reduction
TODO:
Types of changes
Checklist:
Signed-off-by.